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Abstract 



The Gibbs sampler is one of the most popular algorithms for inference in statistical 
models. In this paper, we introduce a herding variant of this algorithm, called 
herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an 
0{1/T) convergence rate for models with independent variables and for fully 
connected probabilistic graphical models. Herded Gibbs is shown to outperform 
Gibbs in the tasks of image denoising with MRFs and named entity recognition 
with CRFs. However, the convergence for herded Gibbs for sparsely connected 
probabilistic graphical models is still an open problem. [] 

1 Introduction 

Over the last 60 years, we have witnessed great progress in the design of randomized sampling 
algorithms; see for example lfT6l l9l[3 22 | and the references therein. In contrast, the design of deter- 
ministic algorithms for "sampling" from distributions is still in its inception |[8][T3]|7]|20l. There are, 
however, many important reasons for pursuing this line of attack on the problem. From a theoreti- 
cal perspective, this is a well defined mathematical challenge whose solution might have important 
consequences. It also brings us closer to reconciling the fact that we typically use pseudo-random 
number generators to run Monte Carlo algorithms on classical. Von Neumann architecture, comput- 
ers. Moreover, the theory for some of the recently proposed deterministic sampling algorithms has 
taught us that they can achieve 0(1/T) convergence rates ||8l lT3]| , which are much faster than the 
standard Monte Carlo rates of 0(1/ ^/T) for computing ergodic averages. From a practical perspec- 
tive, the design of deterministic sampling algorithms creates an opportunity for researchers to apply 
a great body of knowledge on optimization to the problem of sampling; see for example ID for an 
early example of this. 

The domain of application of currently existing deterministic sampling algorithms is still very nar- 
row. Importantly, there do not exist deterministic tools for sampling from unnormalized multivariate 
probability distributions. This is very limiting because the problem of sampling from unnormalized 
distributions is at the heart of the field of Bayesian inference and the probabilistic programming 
approach to artificial intelligence lfT7l l6l [T8lfTTIl . At the same time, despite great progress in Monte 
Carlo simulation, the celebrated Gibbs sampler continues to be one of the most widely-used algo- 
rithms. For, example it is the inference engine behind popular statistics packages ifTTl . several tools 
for text analysis |21|, and Boltzmann machines 121 [12|. The popularity of Gibbs stems from its 
simplicity of implementation and the fact that it is a very generic algorithm. 

Without any doubt, it would be remarkable if we could design generic deterministic Gibbs sam- 
plers with fast (theoretical and empirical) rates of convergence. In this paper, we take steps toward 
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achieving this goal by capitalizing on a recent idea for deterministic simulation known as herding. 
Herding ll24ll23l[T0l is a deterministic procedure for generating samples x e A" C W\ such that the 
empirical moments fi of the data are matched. The herding procedure, at iteration t, is as follows: 

x(*) = argmax(w^*^-^\ 0(x)) 

w(*) = w(*-i) 0(x(*)), (1) 

where cp : X ^ His a feature map (statistic) from A" to a Hilbert space H with inner product (•,•), 
w e "H is the vector of parameters, and fi E H is the moment vector (expected value of over 
the data) that we want to match. If we choose normalized features by making ||(5^x)|| constant for 
all X, then the update to generate samples x*^*^ for t = l,2,...,Tin Equation 111 is equivalent to 
minimizing the objective 

1 ^ 

J(xi,...,xt)- /x--^0(x(*)) , (2) 

where T may have no prior known value and || • || = \/{-,-) is the naturally defined norm based 
upon the inner product of the space 'H lEHD. 

Herding can be used to produce samples from normalized probability distributions. This is done 
as follows. Let /j, denote a discrete, normalized probability distribution, with /i^ e [0, 1] and 
Sr=i Mi = 1- A natural feature in this case is the vector 4>{x) that has all entries equal to zero, 
except for the entry at the position indicated by x. For instance, if x = 2 and n = 5, we have 
cf){x) = (0, 1, 0, 0, 0)^. Hence, p, = T-^ Y.t=i 4>[x^^^) 

is an empirical estimate of the distribution. 
In this case, one step of the herding algorithm involves finding the largest component of the weight 
vector {i* — argmax^gj]^ 2 n] '^1*^^')' setting x^*-' = i*, fixing the i*-entry of 4>{x'-*^) to one 
and all other entries to zero, and updating the weight vector: w^*^ — w^*"^) + (/x — 4>{x^^'>)). The 
output is a set of samples {x^^\ . . . ,x'^'^^ for which the empirical estimate fl converges on the 
target distribution fi as 0(1/T). 

The herding method, as described thus far, only applies to normalized distributions or to problems 
where the objective is not to guarantee that the samples come from the right target, but to ensure that 
some moments are matched. An interpretation of herding in terms of Bayesian quadrature has been 
put forward recently by OJJ. 

In this paper, we will show that it is possible to use herding to generate samples from more complex 
unnormalized probability distributions. In particular, we introduce a deterministic variant of the 
popular Gibbs sampling algorithm, which we refer to as herded Gibbs. While Gibbs relies on draw- 
ing samples from full-conditionals at random, herded Gibbs generates the samples by matching 
the full-conditionals. That is, one simply applies herding to all the full-conditional distributions. 

The experiments will demonstrate that the new algorithm outperforms Gibbs sampling and mean 
field methods in the domain of sampling from sparsely connected probabilistic graphical models, 
such as grid-lattice Markov random fields (MRFs) for image denoising and conditional random 
fields (CRFs) for natural language processing. 

We advance the theory by proving that the deterministic Gibbs algorithm converges for distributions 
of independent variables and fully-connected probabilistic graphical models. However, a proof es- 
tablishing suitable conditions that ensure convergence of herded Gibbs sampling for sparsely con- 
nected probabilistic graphical models is still unavailable. 



2 Herded Gibbs Sampling 

For a graph of discrete nodes Q — (V, E), where the set of nodes are the random variables V = 
Xi E X, let TT denote the target distribution defined on Q. 

Gibbs sampling is one of the most popular methods to draw samples from tt. Gibbs alternates (either 
systematically or randomly) the sampling of each variable Xi given X^(j) — 's.j\f{i), where i is the 
index of the node, and N{i) denotes the neighbors of node i. That is, Gibbs generates each sample 
from its full-conditional distribution p(Xi|xjv^(i)). 
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Algorithm 1 Herded Gibbs Sampling. 



Input: T. 

Step 1; Set t = 0. Initialize X*^*^^ in the support of tt and wf^.^^^,^ in {'n:{Xi = l|x^(i)) — 
l,7r(X, = l|x^(i))). 
for i = 1 ^ T do 

Step 2: Pick a node i according to some policy. Denote w — w^* (^Ij) . 

Step 3: If w > 0, set xf' — 1, otherwise set x'p ~ 0. 

Step 4: Update weight w^'^ = w'^'-H,, + it{X, = 1 jx^V/^^ ) - xf ^ 

Step 5: Keep the values of all the other nodes xj*^ — xj* ^\ Vj ^ i and all the other weights 

W,-' ' , VT Tt « OrXA/-(-,,-i 7^ Xj., ./. 

end for 

Output: x(i), . . . ,x(^) 



Herded Gibbs replaces the sampling from full-conditionals with herding at the level of the full- 
conditionals. That is, it alternates a process of matching the full-conditional distributions p(Xi = 
a;i|Xjv'(i)). To do this, herded Gibbs defines a set of auxiliary weights {wi.xy^(i)} for any value of 
Xi — Xi and Xjv'(i) = ~^Miyi)- For ease of presentation, we assume the domain of Xi is binary, 
X — {0; 1}, and we use one weight for every i and assignment to the neighbors y^}^(i). Herded 
Gibbs can be trivially generalized to the multivariate setting by employing weight vectors in M''*! 
instead of scalars. 

If the binary variable Xi has four binary neighbors Xjv'(i), we must maintain 2^ — 16 weight vectors. 
Only the weight vector corresponding to the current instantiation of the neighbors is updated, as 
illustrated in Algorithm[T] The memory complexity of herded Gibbs is exponential in the maximum 
node degree. Note the algorithm is a deterministic Markov process with state (X, W). 

The initialization in step 1 guarantees that X^*) always remains in the support of tt. For a deter- 
ministic scan policy in step 2, we take the value of variables x*^*^^ ,t G N as a sample sequence. 
Throughout the paper all experiments employ a fixed variable traversal for sample generation. We 
call one such traversal of the variables a sweep. 



3 Analysis 

As herded Gibbs sampling is a deterministic algorithm, there is no stationary probability distribution 
of states. Instead, we examine the average of the sample states over time and hypothesize that it 
converges to the joint distribution, our target distribution, tt. To make the treatment precise, we need 
the following definition: 

Definition 1. For a graph of discrete nodes Q — where the set of nodes V = {Xi}^-^^, 

(t) 

Xi e X, Prp is the empirical estimate of the joint distribution obtained by averaging over T 
samples acquired from Q. Pj, is derived from T samples, collected at the end of every sweep over 
N variables, starting from the t*'* sweep: 

p(^)(X = x) = i'f]"\(x(^-~)=x) (3) 



Our goal is to prove that the limiting average sample distribution over time converges to the target 
distribution tt. Specifically, we want to show the following: 

lim Plf'^ (x) = 7r(x),Vr > (4) 
If this holds, we also want to know what the convergence rate is. 
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We begin the theoretical analysis with a graph of one binary variable. For this graph, there is only 
one weight w. Denote Tr{X = 1) as vr for notational simplicity. The sequence of X is determined 
by the dynamics of w (shown in Figure[T]i: 

^ +n- > 0), = ( I '^f'!^ > (5) 

^ ' [0 otherwise 

Lemma [3] in the appendix shows that (tt — 1, tt] is the invariant interval of the dynamics, and the 
state X = 1 is visited at a frequency close to tt with an error: 

\p!f\x^l)~n\<^ (6) 

This is known as the fast moment matching property in ll24ll23l[T0ll . We will show in the next two 
theorems that the fast moment matching property also holds for two special types of graphs, with 
proofs provided in the appendix. 



x=0 x=1 




Figure 1: Herding dynamics for a single variable. 

In an empty graph, all the variables are independent of each other and herded Gibbs reduces to 
running N one-variable chains in parallel. Denote the marginal distribution tt; := Tr{Xi — 1). 

Examples of failing convergence in the presence of rational ratios between the tt^s were observed in 
||4]. There the need for further theoretical research on this matter was pointed out. The following 
theorem provides formal conditions for convergence in the restricted domain of empty graphs. 

Theorem 1. For an empty graph, when herded Gibbs has a fixed scanning order, and 

{l,7ri,...,7rjv} are rationally independent, the empirical distribution P^^^ converges to the tar- 
get distribution tt as T ^ 00 for any r > 0. 

A set of n real numbers, xi, 2:2, ■ • • , a^n, is said to be rationally independent if for any set of rational 
numbers, ai, 02, . . . , a„, we have X^ILi '^i^i = = 0, VI < i < n. The proof of Theoremjl] 
consists of first formulating the dynamics of the weight vector as a constant translation mapping m 
a circular unit cube, and then proving that the weights are uniformly distributed by making use of 
Kronecker-Weyl's theorem | 



For fully-connected (complete) graphs, convergence is guaranteed even with rational ratios. In fact, 
herded Gibbs converges to the target joint distribution at a rate of 0(1/T) with a 0(log(r)) burn-in 
period. This statement is formalized in Theorem|2] 

Theorem 2. For a fully-connected graph, when herded Gibbs has a fixed scanning order and a 
Dobrushin coefficient of the corresponding Gibbs sampler 77 < 1, there exist constants I > 0, and 
B > such that 

d,(p|"^-7r) < ^,vr>r*,T>r*(r) (7) 



where A = T* - ^, t*{T) = log_^ (^^^), and {Sir) := ^\\STr\ 



The constants I and B are defined in Equation 31 for Proposition|4]in the appendix. If we ignore the 



burn-in period and start collecting samples simply from the beginning, we achieve a convergence rate 



of 0(^^^^?^) as stated in Corollary 10 in the appendix. The constant I in the convergence rate has an 
exponential term, with N in the exponent. An exponentially large constant seems to be unavoidable 
for any sampling algorithm when considering the convergence to a joint distribution with 2^ states. 
As for the marginal distributions, it is obvious that the convergence rate of herded Gibbs is also 
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0(1/T) because marginal probabilities are linear functions of the joint distribution. However, in 
practice, we observe very rapid convergence results for the marginals, so stronger theoretical results 
about the convergence of the marginal distributions seem plausible. 

The proof proceeds by first bounding the discrepancy between the chain of empirical estimates of 
the joint obtained by averaging over T herded Gibbs samples, {p!j^^^}, s > t, and a Gibbs chain 
initialized at After one iteration, this discrepancy is bounded above by 0(l/r). 

The Gibbs chain has geometric convergence to tt and the distance between the Gibbs and herded 
Gibbs chains is bounded by 0{1/T). The geometric convergence rate to tt dominates the discrep- 

(t) 

ancy of herded Gibbs and thus we infer that Pj, converges to tt geometrically. To round-off the 
proof, we must find a limiting value for r. The proof concludes with an 0(log(T)) burn-in for r. 

However, for a generic graph we have no mathematical guarantees on the convergence rate of herded 
Gibbs. In fact, one can easily construct synthetic examples for which herded Gibbs does not seem 
to converge to the true marginals and joint distribution. For the examples covered by our theorems 
and for examples with real data, herded Gibbs demonstrates good behaviour The exact conditions 
under which herded Gibbs converges for sparsely connected graphs are still unknown. 



4 Experiments 



4.1 Simple Complete Graph 



We begin with an illustration of how herded Gibbs substantially outperforms Gibbs on a simple 
complete graph. In particular, we consider a fully-connected model of two variables, Xi and X2, 
as shown in Figure |2j the joint distribution of these variables is shown in Table [T] Figure [3] shows 
the marginal distribution P{Xi = 1) approximated by both Gibbs and herded Gibbs for different 
e. As e decreases, both approaches require more iterations to converge, but herded Gibbs clearly 
outperforms Gibbs. The figure also shows that Herding does indeed exhibit a linear convergence 
rate. 



^1 



X2 





Xi = 


Xi -1 


P(X2) 


X2 =0 


1/4 -e 


e 


1/4 


X2 = 1 




3/4- e 


3/4 


P(Xi) 


1/4 


3/4 


1 



Figure 2: Two-variable model. 



Table 1: Joint distribution of the two-variable model. 



4.2 MRF for Image Denoising 

Next, we consider the standard setting of a grid-lattice MRP for image denoising. Let us assume 
that we have a binary image corrupted by noise, and that we want to infer the original clean image. 
Let Xi £ { — 1, +1} denote the unknown true value of pixel i, and yi the observed, noise-corrupted 
value of this pixel. We take advantage of the fact that neighboring pixels are likely to have the same 
label by defining an MRF with an Ising prior. That is, we specify a rectangular 2D lattice with the 
following pair- wise clique potentials: 



il},,{x,,Xj) = [ ^j^^ ] (8) 



and joint distribution: 



(9) 



where i ^ j is used to indicate that nodes i and j are connected. The known parameters Jij establish 
the coupling strength between nodes i and j. Note that the matrix J is symmetric. If all the > 0, 
then neighboring pixels are likely to be in the same state. 
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0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 

Iterations x lo^ Iterations x lo^ Iterations x lo^ Iterations x lo^ 

(a) Approximate marginals obtained via Gibbs (blue) and herded Gibbs (red). 





Iterations Iterations Iterations Iterations 

(b) Log-log plot of marginal approximation errors obtained via Gibbs (blue) and herded Gibbs (red). 




0.5 1 1.5 2 2.5 3 

Iterations x lo^ 



0.5 1 1.5 2 2.5 3 

Iterations xio^ 




0.5 1 1.5 2 2.5 3 

Iterations xio^ 



Iterations x lo 

(c) Inverse of marginal approximation errors obtained via Gibbs (blue) and herded Gibbs (red). 



Figure 3: (a) Approximating a marginal distribution with Gibbs (blue) and herded Gibbs (red) for an 
MRF of two variables, constructed so as to make the move from state (0, 0) to (1, 1) progressively 
more difficult as e decreases. The four columns, from left to right, are for e = 0.1, e = 0.01, 
e = 0.001 and e ~ 0.0001. Table [T] provides the joint distribution for these variables. The error 
bars for Gibbs correspond to one standard deviation. Rows (b) and (c) illustrate that the empirical 
convergence rate of herded Gibbs matches the expected theoretical rate. In the plots of rows (b) 
and (c), the upper-bound in the error of herded Gibbs was used to remove the oscillations so as to 
illustrate the behaviour of the algorithm more clearly. 



The MRF model combines the Ising prior with a likelihood model as follows: 



p(x,y) = p{x)p{y\x) = 



2 IT ^'''^ ' '''^ ' 



Ylp{yi\xi) 



(10) 



The potentials tp^j encourage label smoothness. The likelihood terms p{yi\xi) are conditionally 
independent (e.g. Gaussians with known variance and mean /i. centered at each value of Xi, 
denoted fix J- In more precise terms. 



p(x,y|J,/x,CT) = ^ ^ exp ^^ J^jX,XJ - ^J^iv 



(11) 
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Figure 4: Original image (left) and its conupted version (right), with noise parameter cr = 4. 
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II 



Iterations 



Figure 5: Reconstruction errors for the image denoising task. The results are averaged across 
10 corrupted images with Gaussian noise Af{0,16). The error bars correspond to one standard 
deviation. Mean field requires the specification of the damping factor D. 



When the coupling parameters are identical, say Jij = J, we have '}2,ij Jijfi^ii^j) — 
J^ij fixi^Xj). Hence, different neighbor configurations result in the same value of 
JJ2ij f{xi,Xj). If we store the conditionals for configurations with the same sum together, we 
only need to store as many conditionals as different possible values that the sum could take. This 
enables us to develop a shared version of herded Gibbs that is more memory efficient where we only 
maintain and update weights for distinct states of the Markov blanket of each variable. 

In this exemplary image denoising experiment, noisy versions of the binary image, seen in Figure [4] 
(left), were created through the addition of Gaussian noise, with varying a. Figure |4] (right) shows a 
corrupted image with a ~ 4. The L2 reconstruction errors as a function of the number of iterations, 
for this example, are shown in Figure |5] The plot compares the herded Gibbs method against Gibbs 
and two versions of mean field with different damping factors |fT9l . The results demonstrate that the 
herded Gibbs techiques are among the best methods for solving this task. 

A comparison for different values cr is presented in Table|2] As expected mean field does well in the 
low-noise scenario, but the performance of the shared version of herded Gibbs as the noise increases 
is significantly better. 
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Table 2: Errors of image denoising example after 30 iterations (all measurements have been scaled 
by X lO^'^). We use an Ising prior with Jij = 1 and four Gaussian noise models with different 
it's. For each a, we generated 10 corrupted images by adding Gaussian noise. The final results 
shown here are averages and standard deviations (in parentheses) across the 10 corrupted images. D 
denotes the damping factor in mean field. 





2 


4 


6 


8 


Herded Gibbs 
Herded Gibbs - shared 
Gibbs 

Mean field (D=0.5) 
Mean field (D=l) 


21.58(0.26) 

22.24(0.29) 

21.63(0.28) 

15.52(0.30) 

17.67(0.40) 


32.07(0.98) 

31.40(0.59) 

37.20(1.23) 

41.76(0.71) 

32.04(0.76) 


47.52(1.64) 

42.62(1.98) 

63.78(2.41) 

76.24(1.65) 

51.19(1.44) 


67.93(2.78) 

58.49(2.86) 

90.27(3.48) 

104.08(1.93) 

74.74(2.21) 




Figure 6: Typical skip-chain CRF model for named entity recognition. 



4.3 CRT for Named Entity Recognition 

Named Entity Recognition (NER) involves the identification of entities, such as people and loca- 
tions, within a text sample. A conditional random fied (CRF) for NER models the relationship 
between entity labels and sentences with a conditional probability distribution: P{Y\X, 9), where 
X is a sentence, y is a labeling, and 6* is a vector of coupling parameters. The parameters, 9, are fea- 
ture weights and model relationships between variables Yi and Xj or Yi and Yj. A chain CRF only 
employs relationships between adjacent variables, whereas a skip-chain CRF can employ relation- 
ships between variables where subscripts i and j differ dramatically. Skip-chain CRFs are important 
in language tasks, such as NER and semantic role labeling, because they allow us to model long 
dependencies in a stream of words, see Figure |6] 

Once the parameters have been learned, the CRF can be used for inference; a labeling for some 
sentence X is found by maximizing the above probability. Inference for CRF models in the NER 
domain is typically canied out with the Viterbi algorithm. However, if we want to accommodate long 
term dependencies, thus resulting in the so called skip-chain CRFs, Viterbi becomes prohibitively 
expensive. To surmount this problem, the Stanford named entity recognizer l,15J makes use of 
annealed Gibbs sampling. 

To demonstrate herded Gibbs on a practical application of great interest in text mining, we modify 
the standard inference procedure of the Stanford named entity recognizer by replacing the annealed 
Gibbs sampler with the herded Gibbs sampler. The herded Gibbs sampler in not annealed. To find 
the maximum a posteriori sequence Y, we simply choose the sample with highest joint discrete 
probability. In order to be able to compare against Viterbi, we have purposely chosen to use single- 
chain CRFs. We remind the reader, however, that the herded Gibbs algorithm could be used in cases 
where Viterbi inference is not possible. 

We used the pre-trained 3-class CRF model in the Stanford NER package ifTSl . This model is a 
linear chain CRF with pre-defined features and pre-trained feature weights, 9. For the test set, we 
used the corpus for the NIST 1999 lE-ER Evaluation. Performance is measured in per-entity Fi 

(p 2 - P'"""""";'""'''' ') . For all the methods, except Viterbi, we show Fi scores after 100, 400 and 

\ ^ precision+recall / i x- i j. ^ 

800 iterations in Table [3] For Gibbs, the results shown are the averages and standard deviations 
over 5 random runs. We used a linear annealing schedule for Gibbs. As the results illustrate. 
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"Pumpkin" {Tim Roth) and "Honey Bunny" {Amanda Plummer) are having breakfast in a 
diner. They decide to rob it after realizing they could make money off the customers as 
well as the business, as they did during their previous heist. Moments after they initiate 
the hold-up, the scene breaks off and the title credits roll. As Jules Winiifield {Samuel L. 
Jackson) drives, Vincent Vega {John Travolta) talks about his experiences in Europe, from 
where he has just returned: the hash bars in Amsterdam, the French McDonald's and its 
"Royale with Cheese". 

Figure 7: Results for the application of the NER CRF to a random Wikipedia sample lH]. Entities 
are automatically classified as Person, Location and Organization. 



herded Gibbs attains the same accuracy as Viterbi and it is faster than aimealed Gibbs. Unlike 
Viterbi, herded Gibbs can be easily applied to skip-chain CRFs. After only 400 iterations (90.5 
seconds), herded Gibbs already achieves an Fi score of 84.75, while Gibbs, even after 800 iterations 
(115.9 seconds) only achieves an Fi score of 84.61. The experiment thus clearly demonstrates that 
(i) herded Gibbs does no worse than the optimal solution, Viterbi, and (ii) herded Gibbs yields 
more accurate results for the same amount of computation. Figure |7]provides a representative NER 
example of the performance of Gibbs, herded Gibbs and Viterbi (all methods produced the same 
annotation for this short example). 

Table 3: Gibbs, herded Gibbs and Viterbi for the NER task. The average computational time each 
approach took to do inference for the entire test set is listed (in square brackets). After only 400 
iterations (90.48 seconds), herded Gibbs already achieves an Fi score of 84.75, while Gibbs, even 
after 800 iterations (1 15.92 seconds) only achieves an Fi score of 84.61. For the same computation, 
herded Gibbs is more accurate than Gibbs. 



^ ' Iterations 

Method ^ — 


100 


400 


800 


Annealed Gibbs 


84.36(0.16) [55.73s] 


84.51(0.10) [83.49s] 


84.61(0.05) [115.92s] 


Herded Gibbs 


84.70 [59.08s] 


84.75 [90.48s] 


84.81 [132.00s] 


Viterbi 






84.81[46.74s] 



5 Conclusions and Future Work 

In this paper, we introduced herded Gibbs, a deterministic variant of the popular Gibbs sampling al- 
gorithm. While Gibbs relies on drawing samples from the full-conditionals at random, herded Gibbs 
generates the samples by matching the full-conditionals. Importantly, the herded Gibbs algorithm is 
very close to the Gibbs algorithm and hence retains its simplicity of implementation. 

The synthetic, denoising and named entity recognition experiments provided evidence that herded 
Gibbs outperforms Gibbs sampling. However, as discussed, herded Gibbs requires storage of the 
conditional distributions for all instantiations of the neighbors in the worst case. This storage re- 
quirement indicates that it is more suitable for sparse probabilistic graphical models, such as the 
CRFs used in information extraction. At the other extreme, the paper advanced the theory of de- 
terministic sampling by showing that herded Gibbs converges with rate 0(1/T) for models with 
independent variables and fully-connected models. Thus, there is gap between theory and practice 
that needs to be narrowed. We do not anticipate that this will be an easy task, but it is certainly a key 
direction for future work. 

We should mention that it is also possible to design parallel versions of herded Gibbs in a Jacobi 
fashion. We have indeed studied this and found that these are less efficient than the Gauss-Seidel 
version of herded Gibbs discussed in this paper. However, if many cores are available, we strongly 
recommend the Jacobi (asynchronous) implementation as it will likely outperform the Gauss-Seidel 
(synchronous) implementation. 
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The design of efficient herding algorithms for densely connected probabilistic graphical models 
remains an important area for future research. Such algorithms, in conjunction with Rao Black- 
wellization, would enable us to attack many statistical inference tasks, including Bayesian variable 
selection and Dirichlet processes. 

There are also interesting connections with other algorithms to explore. If, for a fully connected 
graphical model, we build a new graph where every state is a node and directed connections exist 
between nodes that can be reached with a single herded Gibbs update, then herded Gibbs becomes 
equivalent to the Rotor- Router model of Alex Hokoyd and Jim Propf[^ lfT3]| . This deterministic ana- 
logue of a random walk has provably superior concentration rates for quantities such as normalized 
hitting frequencies, hitting times and occupation frequencies. In line with our own convergence 
results, it is shown that discrepancies in these quantities decrease as 0(1/T) instead of the usual 
0{1/\/T). We expect that many of the results from this literature apply to herded Gibbs as well. 
The connection with the work of Art Owen and colleagues, see for example |7|, also needs to 
be explored further Their work uses completely uniformly distributed (CUD) sequences to drive 
Markov chain Monte Carlo schemes. It is not clear, following discussions with Art Owen, that CUD 
sequences can be constructed in a greedy way as in herding. 
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A Proof of Theorem [T] 



We first show that the weight dynamics of a one-variable herding algorithm are restricted to an 
invariant interval of length 1. 

Lemma 3. Ifw is the weight of the herding dynamics of a single binary variable X with probability 

P{X = 1) = TT, andw^^^ G (tt — 1, tt] af some step s > 0, then w^^^ G (tt — 1, tt] , Vi > s. Moreover, 
far T Cz N, we have: 

s+T 



= 1] e [Ttt - 1, Ttt + 1] (12) 

t=s+l 
S+T 

J2 = 0] e [T(l - tt) - 1, r(i - tt) + 1]. (13) 

t=s+l 

Proof. We first show that w e (tt — 1, tt], Vt > s. This is easy to observe by induction as w'^-' G 
(tt — 1, tt] and if w'^*' G (tt — 1, tt] for some t > s, then, following Equationpl we have: 

^(t+i) ^ r u;(*)+7r-lG (7r-l,2^-l] C (^-1,^] ifu;(*)>0 
|_ w'^*' + TT G (27r — 1, tt] C (tt — 1, tt] otherwise. 



Summing up both sides of Equation[5]over t immediately gives us the result of Equation 12 



smce: 



Ttt- J2 = 1] = - w<-'^ G [-1, 1]. (15) 



t=s+l 



In addition, Equation [l3| follows by observing that 0] = 1 - = 1]. □ 

When w is outside the invariant interval, it is easy to observe that w will move into it monotonically 
at a Unear speed in a transient period. So we will always consider an initialization of w G (tt — 1, vr] 
from now on. 

Equivalently, we can take a one-to-one mapping w -s— w mod 1 (we define 1 mod 1 = 1) and 
think of w as updated by a constant translation vector in a circular unit interval (0, 1] as shown in 
Figure |8] That is, 

= + n) mod 1, = ( I < (16) 

^ ' othei-wise 







1 



w 




I xi=1pi Ai=0 1 



w 



Figure 8: Equivalent weight dynamics for 
a single variable. 



Figure 9: Dynamics of herding with two indepen- 
dent variables. 



We are now ready to give the proof of Theorem [T] 
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Proof of Theorem^ For an empty graph of N independent vertices, the dynamics of the weight 
vector w are equivalent to a constant translation mapping in an A'^-dimensional circular unit space 
(0, 1], as shown in Figure[9j 



^ (w(*-i)+7r) modi 
= (w(°'+i7r) modi, 



(t) 



1 if W] ' < TTi 

otherwise 



,yi<i<N (17) 



The Kronecker-Weyl theorem 125 1 states that the sequence w'*' = tir mod 1, i G Z+ is equidis- 
tributed (or uniformly distributed) on (0, 1] if and only if (l,7ri, . . . jTr^r) is rationally indepen- 
dent. Since we can define a one-to-one volume preserving transformation between w*^*^ and w*^*^ 
as (w^'^ + w(°)) mod 1 — w'^*\ the sequence of weights {w^*'} is also uniformly distributed in 
(0,1^. 



Define the mapping from a state value Xi to an interval of Wi as 



A,{x) 



_/ (0,^,1 ifx = l 
{jTi, 1] if x = 



(18) 



and let \ Ai \ be its measure. We obtain the limiting distribution of the joint state as 



lim 4^'(X = x) - lim 



N 



N 



4=1 

N 



W_i:{X^ = Xi) 



i=l 



= 7r(X = x) 



(19) 

□ 
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B Proof of Theorem in 



In this appendix, we give an upper bound for the convergence rate of the sampling distribution in 
fully connected graphs. As herded Gibbs sampling is deterministic, the distribution of a variable's 
state at every iteration degenerates to a single state. As such, we study here the empirical distribution 
of a collection of samples. 

The structure of the proof is as follows (with notation defined in the next subsection): We study the 
distribution distance between the invariant distribution tt and the empirical distribution of T samples 

(r) 

collected starting from sweep r, Pj. . We show that the distance decreases as t t + 1 with the 
help of an auxiliary regular Gibbs sampling Markov chain initialized at tt'"^ — P!f\ as shown in 
Figure [To] On the one hand, the distance between the regular Gibbs chain after one iteration, t:'^^\ 
and TT decreases according to the geometric convergence property of MCMC algorithms on compact 

state spaces. On the other hand, we show that in one step the distance between p!f^^^ and vr'^' 
increases by at most 0(1 /T). Since the 0{1/T) distance term dominates the exponentially small 

distance term, the distance between P^^^^'' and tt is bounded by 0(1/T). Moreover, after a short 

burn-in period, L = 0(log(T)), the empirical distribution p!f^^^ will have an approximation error 
intheorderof 0(1/T). 

Pf =jr^«>^L^ Herded Gibbs 




^''^ Gibbs 

Distr. Distance 



^ — ^ ^ Invariant Distr. 

Figure 10: Transition kernels and relevant distances for the proof of Theoremp^ 



B.l Notation 

Assume without loss of generality that in the systematic scanning policy, the variables are sampled 
in the order 1, 2, • • • ,N. 

B.1.1 State Distribution 

• Denote by the support of the distribution tt, that is, the set of states with positive 
probability. 

• We use T to denote the time in terms of sweeps over all of the N variables, and t to denote 
the time in terms of steps where one step constitutes the updating of one variable. For 
example, at the end of r sweeps, we have t ~ tN. 

• Recall the sample/empirical distribution, P^p\ presented in Definition [T| Figure [TT| pro- 
vides a visual interpretation of the definition. 

• Denote the sample/empirical distribution at the z*'' step within a sweep as p!f^ , t > 0, T > 



0, < i < A^, as shown in Figure 12 



4:/(x=x) = i'f]"\(x('=^+^)=x). 



fc=r 
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,T+1 



Figure 11: Distribution over time at the end of every sweep. 



This is the distribution of T samples collected at the i step of every sweep, starting from 
the r*'' sweep. Clearly, p!f ^ = Pi^' - ^^^"^^ 



T,0 



Pr 



T,N 



t^tN tN+1 tN+2 

o — *■ o — > o - -> o 



(t + 1)A'(t + 1)^ + 1 {T + l)N + 2 



^ r.i ^ 7-.2 

Figure 12: Distribution over time within a sweep. 

• Denote the distribution of a regular Gibbs sampling Markov chain after L sweeps of updates 
over the N variables with tt^-^^ , i > 0. 

For a given time r, we construct a Gibbs Markov chain with initial distribution -k'^ — P^^ 
and the same scanning order of herded Gibbs, as shown in Figure [TO] 

B.1.2 Transition Kernel 

• Denote the transition kernel of regular Gibbs for the step of updating a variable Xi with 7i, 
and for a whole sweep with T. 

By definition, 7r°T = tt^. The transition kernel for a single step can be represented as a 
2^ X 2^ matrix: 

^(-' - { .(X, = ,.|x-.) ot?er;i'^"^ , 1 < ^ < iV, X, y e {0, 1}- (20) 

where x is the current state vector of iV variables, y is the state of the next step, and 
denotes all the components of x excluding the i*'* component. If 7r(x_i) = 0, the condi- 
tional probability is undefined and we set it with an arbitrary distribution. Consequently, T 
can also be represented as: 

• Denote the Dobrushin ergodic coefficient 131 of the regular Gibbs kernel with r\ e [0, 1]. 
When < 1, the regular Gibbs sampler has a geometric rate of convergence of 

d„(7r(i) - vr) = rf„(r7rW - tt) < ^d^(^^^'^^ - 7r),V7rW. (21) 
A common sufficient condition for 77 < 1 is that 7r(X) is strictly positive. 

• Consider the sequence of sample distributions P^^ , r = 0, 1, • • • in Figures [ll]and[l2] We 
define the transition kernel of herded Gibbs for the step of updating variable Xi with T!^} , 
and for a whole sweep with t!^^ ■ 

Unlike regular Gibbs, the transition kernel is not homogeneous. It depends on both the 
time T and the sample size T. Nevertheless, we can still represent the single step transition 
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kernel as a matrix: 



(22) 

where P^^{Xi = ?;i|x_i) is defined as: 

^Vden 

fe = T 

iVden = rP^ll(X-. = X_0 = I(xLf +*-'^ = X_0, (23) 

fe = T 

where iVnum is the number of occurrences of a joint state, and TVden is the number of oc- 
currences of a conditioning state in the previous step. When 7r(x_i) = 0, we know that 

~ (r) 

-^den = with a proper initialization of herded Gibbs, and we simply set 7^ / = Ti for 
these entries. It is not hard to verify the following identity by expanding every term with 
its definition 

^T,i — 'T,i 



pir+1) _ p(r)^{T) 

'Y ± rp I rp 

'T — 'T,l 'Ta ' ' ' 'T,N- 



and consequently, 
with 

B.2 Linear Visiting Rate 

We prove in this section that every joint state in the support of the target distribution is visited, at 
least, at a linear rate. This result will be used to measure the distance between the Gibbs and herded 
Gibbs transition kernels. 

Proposition 4. If a graph is fully connected, herded Gibbs sampling scans variables in a fixed order, 
and the corresponding Gibbs sampling Markov chain is irreducible, then for any state x G and 
any index z G [1, N], the state is visited at least at a linear rate. Specifically, 

31>0,B> 0,s.t.yi G [1, A^],x G X+,T e N,s G N 

s+T-l 



J2 I [X(*=^^-+') = X 

k—s 



>IT -B (24) 



Denote the minimum nonzero conditional probability as 

TTmin = min 7r(Xj|x_i). 

l<i<JV,7r(i;i|x_i)>0 

The following lemma, which is needed to prove Proposition [4j gives an inequality between the 
number of visits of two sets of states in consecutive steps. 

Lemma 5. For any integer i G [1, -/V] and two sets of states X, Y C with a mapping F :1L Y 
that satisfies the following condition: 

Vx G X, F(x)_^ = x_„ UxexF(x) = Y, (25) 

we have that, for any s > and T > 0, the number of times Y is visited in the set of steps 
Ci = {t = kN + i : s < k < k + T ~ 1} is lower bounded by a function of the number of times X 
is visited in the previous steps Ci-i = {t = kN + i— l:s<k<k + T— 1} as: 

J2 I [X^*) G y] > Tr„n„ I [X(*) G X] - |Y| (26) 
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Proof. As a complement to Condition 25 we can define ^ as the inverse mapping from Y to 
subsets of X so that for any y e Y, x G i^~^(y), we have x_i = y_j, and UygY-F'^^(y) = X. 

Consider any state y e Y, when y is visited in C,;, the weight Wi^y_. is active. Let us denote 
the set of all the steps in \sN + l,s(A^ + T) + TV] when 'Wi,y_. is active by Ci(y_i), that is, 

Ci(y_i) = {t:t€ Ci,xL*) = y- J. Applying Lemma [i] we get 



I [xW = y] > 7r(zj,|y_,)|C.(y_,)l ^ 1 > ^mm|C,(y_,)| - 1. 



(27) 



Since the variables X_i are not changed at steps in Ci, we have 



(28) 



Combining the fact that Uygy^ ^(y) = 
proves the lemma: 



and summing up both sides of Equation 27 over Y 



t€C, ySY \ teCi_i 



1 > TT, 



min / , 

tec,_: 



[x(* 



(29) 

□ 



Remark 6. A fully connected graph is a necessary condition for the application of Lemma^in the 
proof. If a graph is not fully connected (N(i) ^ —i), a weight Wi^yj^^.^ may be shared by multiple 
full conditioning states. In this case Ci(y_i) is no longer a consecutive sequence of times when the 
weight is updated, and Lemma^does not apply here. 

Now let us prove Proposition|4]by iteratively applying Lemma|5] 



Proof of Proposition^ Because the corresponding Gibbs sampler is irreducible and any Gibbs sam- 
pler is aperiodic, there exists a constant t* > such that for any state y £ and any step in a 
sweep, i, we can find a path of length t* for any state x e X+ with a positive transition probability, 
Path{x.) — {x — x(0), x(l), . . . , x(t*) = y), to connect from x to y, where each step of the path 
follows the Gibbs updating scheme. For a strictly positive distribution, the minimum value of t* is 
N. 

Denote r* — \t* /N~\ and the j*'' element of the path Path{yi) as Path{x, j). We can define t* + 1 
subsets Sj C < j < t* as the union of all the j"* states in the path from any state in 



By definition of these paths, we know 5*0 = and Sf = {y}, and there exits an integer and a 
mapping Fj : Sj-i Sj,yj that satisfy the condition in Lemma|5](i(j) is the index of the variable 
to be updated, and the mapping is defined by the transition path). Also notice that any state in Sj 
can be different from y by at most min{A^, t* — j} variables, and therefore \ Sj \ < 2"""^^'* 



17 



Let us apply Lemma |5]recursively from j — t* to 1 as 

s+T-l 

J2 l[x(*=™)=y 



s+T-l 

J2 I [X(*=^'=+') = y 



k—s 



> 



k—s~\-T 
s+T-l 



fc— s+r* 

s+T-l 



> ... 



fc— s+r* 



s+T-l 



fc— s+r* 



r-l 



3=0 



The proof is concluded by choosing the constants 



j=0 



(30) 



' — "mini ^ — ' "mm ^ "min^ 



(31) 



□ 



B.3 Herded Gibbs's Transition Kernel 7f is an Approximation to T 

The following proposition shows that 7^ is an approximation to the regular Gibbs sampler's tran- 
sition kernel T with an error of 0(1/T). 

Proposition 7. For a fully connected graph, if the herded Gibbs has a fixed scanning order and the 
corresponding Gibbs sampling Markov chain is irreducible, then for any t > 0, T > T* ^ 
where I and B are the constants in Proposition^ the following inequality holds: 

AN 

(32) 



IT 



Proof. When x ^ we have the equality '/^^■■'(x, y) = 7i(x, y) by definition. When x € X+ 
but y ^ then A'den = (see the notation of for definition of iVden) as y will never be 
visited and thus 7^^^(x,y) = = 7i(x,y) also holds. Let us consider the entries in 7^^^(x, y) 
with x, y e X+ in the following. 

Because X_i is not updated at step of every sweep, we can replace i — 1 in the definition of iVden 
by i and get 

iVde„ = 'E'l(X(^f+''Ux_,). 
k — T 

Notice that the set of times {t kN + i : t < k < r + r-l,X^j = ^-i)}^ whose size is A'den? is a 
consecutive set of times when Wi^x^i is updated. By Lemma[3j we obtain a bound for the numerator 

A^num e [iVden7r(Xj = y^\X-i) - l,iVden7r(X, = y^\x^^) + 1] ^ 



\P^^UX^ = - AX, = 2;,|x_,)| = 1^ - 7t{X, = y,\K_,)\ < — . (33) 

Also by Proposition |4] we know every state in X+ is visited at a linear rate, there hence exist 
constants I > and B > 0, such that the number of occurrence of any conditioning state x_i, A'den, 



,A^n, 
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is bounded by 

^den > ^x)>lT- B> -T, VT > (34) 



^"-^-^ ,™ 2B 



Combining equations ( 33 \ and ( 34 1, we obtain 

|pj,;:^(X, = y,|x_,)-^(X, = y,|x_,)|<-, Vr>— . (35) 

Since the matrix 'T^} and Ti differ only at those elements where x_, = y_j, we can bound the Li 
induced norm of the transposed matrix of their difference by 

y 

= max^ \p!:^J{Xi = y^jx.i) - ■k{X, = 2/i|x_j)| 

VT>?^ (36) 

Observing that both 7^^^ and T are multiplications of N component transition matrices, and the 
transition matrices, 7^^' and %, have a unit Li induced norm as: 



|(r£,')^||i = max^ \WJ{^,y)\ = ^k^Y.Pt1{X^ = 2/.|x_,) = 1 (37) 
y y 

II (7-)^ 111 = max^ |7;(x,y)| = max^P(X, = y,|x_0 = 1 (38) 



we can further bound the Li norm of the difference, (T^'^^ — TY' ■ Let P e be any vector with 
nonzero norm. Using the triangular inequality, the difference of the resulting vectors after applying 

Tf and T is bounded by 

— 11^ 'T.l 'T,2 • • ■ 'T.Af ^ '1 'T,2 ■ ' ' 'T,7Vlll^ 

II P'7-'7-('^)'7-(t) -f-C-r) P'TT'f'f'^) T't'^) II -I- 
ll^/l /7.2 'T,3 ■ ■ • 'T,N ^ ^ n 12 It,3 ' ■ • 'T,Ar||l + 

ll-PTi . . . - PTi . . . TAr-iTTvll 1 (39) 

where the i'th term is 

-(■^) _ 'r\'r('^) 'f-('^) IL ^ II pt: -rr . /"r(^) 



PTi . . . 7;-i(r^y - 7;)r^y+i . . . r^;^||i < IIPTi . . . 7;-i(r^y - 7;)||i (Unit norm, Eqn.|37 

4 

It 

IT 



<\\PTi...%^i\\i^ (Eqn.|3S 



4 

< 11^*11 1777; (Unit Li norm, Eqn.|38 



(40) 

Consequently, we get the Li induced norm as 

||(fr-rn|..n.x "^'f;-^'"- <g. VT>^, ,41, 

□ 
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B.4 Proof of Theorem|2] 



When we initialize the herded Gibbs and regular Gibbs with the same distribution (see Figure [T0|, 
since the transition kernel of herded Gibbs is an approximation to regular Gibbs and the distribution 
of regular Gibbs converges to the invariant distribution, we expect that herded Gibbs also approaches 
the invariant distribution. 



Proof of Theorem^ Construct an auxiliary regular Gibbs sampling Markov chain initialized with 

7r'^'^'(X) = p!f \^) and the same scanning order as herded Gibbs. As r/ < 1, the Gibbs Markov 
chain has uniform geometric convergence rate as shown in Equation pT] ). 

Also, the Gibbs Markov chain must be irreducible due to 77 < 1 and therefore Proposition |7] applies 
here. We can bound the distance between the distributions of herded Gibbs after one sweep of all 



variables, p!f!'^^\ and the distribution after one sweep of regular Gibbs sampling, tt^^-* by 

d.(p(^+^' - 7r(i)) = d„(^(°)(r^^) - r)) = Ih^'Kf^^^ -T)h 



T 

< f ll^<»'ll. 



27V 

7t ' 



VT > T*,T > 0. 



(42) 



(r) 

Now we study the change of discrepancy between Pj, ' and tt as a function as r. 
Applying the triangle inequality of : 

yir+l) 
T 

2N 

w 

The last inequality follows Equations ( [21} and ( [42| . When the sample distribution is outside a 
neighborhood of tt, B^-^ (tt), with ei 



d,{P^^^'' - tt) = - 7r(i) + ^(1) - ^) < - ^(1)) + d„(7r(i) ~ tt) 

< — +?7d„(P^^^ -tt), VT>r*,T>0. 



(43) 



il-v)lT 



I.e. 



dy{P!f^ ~7t) > 



4N 



(44) 



we get a geometric convergence rate toward the invariant distribution by combining the two equa- 
tions above: 

d„(p|"+^^ -7r)< l^d,(P^") - tt) + r;d„(P^") -tt)^ li^d^(pM _ (45^ 

So starting from r = 0, we have a bum-in period for herded Gibbs to enter B^^ (tt) in a finite number 
of rounds. Denote the first time it enters the neighborhood by r'. According to the geometric 



convergence rate in Equations 45 and dy (P^ 



(0) 



tt) < 1 



r' < 



l0gl±27 (- 



dy{P 



(0) 



< 



logi±:z(ei) =\r*{T)^. 



(46) 



After that burn-in period, the herded Gibbs sampler will stay within a smaller neighborhood, B^^ (tt) 
with., = l±|ff,i.e. 



\ — rj 11 



(47) 



This is proved by induction: 



1. Equation holds at r = r' + 1. This is because pip ^ G B^^ (vr) and following Eqn. 
we get 



43 



4(p(^'+i)-^)<^+r?6i=e2 

(r-l) 



2. For any r > t' + 2, assume Pj. ' G B^^iii). Since £2 < ei, P^ 



(48) 



is also in the ball 



(tt). We can apply the same computation as when t = r' + 1 to prove d^iPj. — tt) < 
£2. So inequality ( pT] ) is always satisfied by induction. 
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Consequently, Theorem |2] is proved when combining (47i with the inequality r' < \t*{T)~\ in 
Equation([46]l. □ 



Remark 8. Similarly to the regular Gibbs sampler, the herded Gibbs sampler also has a burn-in 
period with geometric convergence rate. After that, the distribution discrepancy is in the order of 
0{1/T), which is faster than the regular Gibbs sampler Notice that the length of the burn-in period 
depends on T, specifically as a function of\og{T). 

Remark 9. Irrationality is not required to prove the convergence on a fully-connected graph. 

Corollary 10. When the conditions of Theorem^hold, and we start collecting samples at the end 
of every sweep from the beginning, the error of the sample distribution is bounded by: 



d„(4^-)-.)<A±lS = 0(i^^), Vr>T*+r*(T*) (49) 



Proof. Since r* (T) is a monotonically increasing function of T, for any T > T* + t* (T* ), we can 
find a number t so that 

T ^t + T*{t),t > T*. 

Partition the sample sequence Sq^t = {X^'^^' : < fc < T} into two parts: the burn-in period 
S'o,r»(t) and the stable period Sr'{t),T- The discrepancy in the burn-in period is bounded by 1 and 
according to Theorem|2] the discrepancy in the stable period is bounded by 

d,{P{St,T) - it) < ^. 
Hence, the discrepancy of the whole set Sq^t is bounded by 

< (^^(P(5o,..(t)) - ^) + (^|(P(^..(t),T) - ^ 



< 



^ dy{P{SQ^r*{t)) - tt) + -d^{P{Sr*{t).T) - tt) 



<I^.l + l^<-^(^) + \ (50) 
- T T t - T ' 

□ 
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